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Abstract 

The on-line shortest path problem is considered under various models of partial monitor- 
ing. Given a weighted directed acyclic graph whose edge weights can change in an arbitrary 
(adversarial) way a decision maker has to choose in each round of a game a path between two 
distinguished vertices such that the loss of the chosen path (defined as the sum of the weights of 
its composing edges) be as small as possible. In a setting generalizing the multi-armed bandit 
problem, after choosing a path, the decision maker learns only the weights of those edges that 
belong to the chosen path. For this problem, an algorithm is given whose average cumulative 
loss in n rounds exceeds that of the best path, matched off-line to the entire sequence of the 
edge weights, by a quantity that is proportional to \j \fn and depends only polynomially on the 
number of edges of the graph. The algorithm can be implemented with linear complexity in the 
number of rounds n and in the number of edges. An extension to the so-called label efficient 
setting is also given, in which the decision maker is informed about the weights of the edges 
corresponding to the chosen path at a total of m < n time instances. Another extension is 
shown where the decision maker competes against a time- varying path, a generalization of the 
problem of tracking the best expert. A version of the multi-armed bandit setting for shortest 
path is also discussed where the decision maker learns only the total weight of the chosen path 
but not the weights of the individual edges on the path. Applications to routing in packet 
switched networks along with simulation results are also presented. 

Index Terms: On-line learning, shortest path problem, multi-armed bandit problem. 
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1 Introduction 



In a sequential decision problem, a decision maker (or forecaster) performs a sequence of actions. 
After each action the decision maker suffers some loss, depending on the response (or state) of the 
environment, and its goal is to minimize its cumulative loss over a certain period of time. In the 
setting considered here, no probabilistic assumption is made on how the losses corresponding to 
different actions are generated. In particular, the losses may depend on the previous actions of the 
decision maker, whose goal is to perform well relative to a set of reference forecasters (the so-called 
"experts") for any possible behavior of the environment. More precisely, the aim of the decision 
maker is to achieve asymptotically the same average (per round) loss as the best expert. 

Research into this problem started in the 1950s (see, for example, Blackwell [5j and Hannan 
[18j for some of the basic results) and gained new life in the 1990s following the work of Vovk [29J, 
Littlestone and Warmuth [24J, and Cesa-Bianchi et al. [7J. These results show that for any bounded 
loss function, if the decision maker has access to the past losses of all experts, then it is possible 
to construct on-line algorithms that perform, for any possible behavior of the environment, almost 
as well as the best of N experts. More precisely, the per round cumulative loss of these algorithms 
is at most as large as that of the best expert plus a quantity proportional to whiN/n for any 
bounded loss function, where n is the number of rounds in the decision game. The logarithmic 
dependence on the number of experts makes it possible to obtain meaningful bounds even if the 
pool of experts is very large. 

In certain situations the decision maker has only limited knowledge about the losses of all 
possible actions. For example, it is often natural to assume that the decision maker gets to know 
only the loss corresponding to the action it has made, and has no information about the loss it would 
have suffered had it made a different decision. This setup is referred to as the multi-armed bandit 
problem, and was considered, in the adversarial setting, by Auer et al. pQ who gave an algorithm 
whose normalized regret (the difference of the algorithm's average loss and that of the best expert) 
is upper bounded by a quantity which is proportional to -^/iVln N/n. Note that, compared to the 
full information case described above where the losses of all possible actions are revealed to the 
decision maker, there is an extra y/~N factor in the performance bound, which seriously limits the 
usefulness of the bound if the number of experts is large. 

Another interesting example for the limited information case is the so-called label efficient 
decision problem (see Helmbold and Panizza |22j) in which it is too costly to observe the state of 
the environment, and so the decision maker can query the losses of all possible actions for only a 
limited number of times. A recent result of Cesa-Bianchi, Lugosi, and Stoltz [9] shows that in this 
case, if the decision maker can query the losses m times during a period of length n, then it can 
achieve 0(y/]nN/m) average excess loss relative to the best expert. 

In many applications the set of experts has a certain structure that may be exploited to construct 
efficient on-line decision algorithms. The construction of such algorithms has been of great interest 
in computational learning theory. A partial list of works dealing with this problem includes Herbster 
and Warmuth [19], Vovk [30J, Bousquet and Warmuth [6j, Helmbold and Schapire [27] , Takimoto 
and Warmuth [28] , Kalai and Vempala [23] , Gyorgy at al. \13\ [14} [T5] . For a more complete survey, 
we refer to Cesa-Bianchi and Lugosi Chapter 5]. 

In this paper we study the on-line shortest path problem, a representative example of structured 
expert classes that has received attention in the literature for its many applications, including, 
among others, routing in communication networks; see, e.g., Takimoto and Warmuth [28], Awerbuch 
et al. [3], or Gyorgy and Ottucsak [T7J, and adaptive quantizer design in zero-delay lossy source 
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coding; see, Gyorgy et al. [T3], [T31 H6] . In this problem, a weighted directed (acyclic) graph is given 
whose edge weights can change in an arbitrary manner, and the decision maker has to pick in each 
round a path between two given vertices, such that the weight of this path (the sum of the weights 
of its composing edges) be as small as possible. 

Efficient solutions, with time and space complexity proportional to the number of edges rather 
than to the number of paths (the latter typically being exponential in the number of edges), have 
been given in the full information case, where in each round the weights of all the edges are revealed 
after a path has been chosen; see, for example, Mohri [26J, Takimoto and Warmuth [28], Kalai and 
Vempala [23], and Gyorgy et al. [T5] . 

In the bandit setting only the weights of the edges or just the sum of the weights of the edges 
composing the chosen path are revealed to the decision maker. If one applies the general bandit 
algorithm of Auer et al. |T], the resulting bound will be too large to be of practical use because 
of its square-root-type dependence on the number of paths N . On the other hand, using the 
special graph structure in the problem, Awerbuch and Kleinberg [3] and McMahan and Blum [25] 
managed to get rid of the exponential dependence on the number of edges in the performance bound. 
They achieved this by extending the exponentially weighted average predictor and the follow-the- 
perturbed-leader algorithm of Hannan [18] to the generalization of the multi-armed bandit setting 
for shortest paths, when only the sum of the weights of the edges is available for the algorithm. 
However, the dependence of the bounds obtained in [2] and [25] on the number of rounds n is 
significantly worse than the 0(1/ y/n) bound of Auer et al. pQ. Awerbuch and Kleinberg [4] consider 
the model of "non-oblivious" adversaries for shortest path (i.e., the losses assigned to the edges can 
depend on the previous actions of the forecaster) and prove an 0(n~ 1 ^) bound for the expected 
per-round regret. McMahan and Blum [25] give a simpler algorithm than in [4J however obtain a 
bound of the order of 0(n _1//4 ) for the expected regret. 

In this paper we provide an extension of the bandit algorithm of Auer et al. p] unifying the 
advantages of the above approaches, with a performance bound that is polynomial in the number 
of edges, and converges to zero at the right 0(1 /y/n) rate as the number of rounds increases. We 
achieve this bound in a model which assumes that the losses of all edges on the path chosen by the 
forecaster are available separately after making the decision. We also discuss the case (considered 
by [3] arid [25]) in which only the total loss (i.e., the sum of the losses on the chosen path) is known 
to the decision maker. We exhibit a simple algorithm which achieves an 0(n~ 1 /^) per-round regret 
with high probability against "non-oblivious" adversary. In this case it remains an open problem 
to find an algorithm whose cumulative loss is polynomial in the number of edges of the graph and 
decreases as <3(n -1 / 2 ) with the number of rounds. 

In Section [2] we formally define the on-line shortest path problem, which is extended to the 
multi-armed bandit setting in Section [3l Our new algorithm for the shortest path problem in 
the bandit setting is given in Section [5] together with its performance analysis. The algorithm 
is extended to solve the shortest path problem in a combined label efficient multi-armed bandit 
setting in Section Another extension, when the algorithm competes against a time- varying path 
is studied in Section [6l An algorithm for the "restricted" multi- armed bandit setting (when only the 
sums of the losses of the edges are available) is given in Section [7J Simulation results are presented 
in Section [H 
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2 The shortest path problem 



Consider a network represented by a set of vertices connected by edges, and assume that we have 
to send a stream of packets from a distinguished vertex, called source, to another distinguished 
vertex, called destination. At each time slot a packet is sent along a chosen route connecting source 
and destination. Depending on the traffic, each edge in the network may have a different delay, and 
the total delay the packet suffers on the chosen route is the sum of delays of the edges composing 
the route. The delays may change from one time slot to the next one in an arbitrary way, and 
our goal is to find a way of choosing the route in each time slot such that the sum of the total 
delays over time is not significantly more than that of the best fixed route in the network. This 
adversarial version of the routing problem is most useful when the delays on the edges can change 
dynamically, even depending on our previous routing decisions. This is the situation in the case 
of ad-hoc networks, where the network topology can change rapidly, or in certain secure networks, 
where the algorithm has to be prepared to handle denial of service attacks, that is, situations where 
willingly malfunctioning vertices and links increase the delay; see, e.g., Awerbuch et al. [3]. 

This problem can be cast naturally as a sequential decision problem in which each possible 
route is represented by an action. However, the number of routes is typically exponentially large 
in the number of edges, and therefore computationally efficient algorithms are called for. Two 
solutions of different flavor have been proposed. One of them is based on a follow-the-perturbed- 
leader forecaster, see Kalai and Vempala [23J, while the other is based on an efficient computation 
of the exponentially weighted average forecaster, see, for example, Takimoto and Warmuth [28J. 
Both solutions have different advantages and may be generalized in different directions. 

To formalize the problem, consider a (finite) directed acyclic graph with a set of edges E = 
{ei, . . . , e\E\\ and a set of vertices V . Thus, each edge e G E is an ordered pair of vertices (ui,^). 
Let u and v be two distinguished vertices in V. A path from u to v is a sequence of edges e^ , . . . , 
such that = (u,vi), = (vj-i,Vj) for all j = 2,...,k — 1, and e^ = (vk-i,v). Let 
V = . .. , i/v} denote the set of all such paths. For simplicity, we assume that every edge in 
E is on some path from u to v and every vertex in V is an endpoint of an edge (see Figure Q] for 
examples) . 

v 

I *f *f *f *f 



u 




u 



(a) (b) 
Figure 1: Two examples of directed acyclic graphs for the shortest path problem. 



In each round t = 1, . . . ,n of the decision game, the decision maker chooses a path It among 
all paths from u to v. Then a loss i e ^t G [0, 1] is assigned to each edge e G E. We write e G i if the 
edge e G E belongs to the path % G V, and with a slight abuse of notation the loss of a path i at 
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time slot t is also represented by i^t- Then £i yt is given as 



and therefore the cumulative loss up to time t of each path i takes the additive form 



where the inner sum on the right-hand side is the loss accumulated by edge e during the first t 
rounds of the game. The cumulative loss of the algorithm is 



It is well known that for a general loss sequence, the decision maker must be allowed to use 
randomization to be able to approximate the performance of the best expert (see, e.g., Cesa-Bianchi 
and Lugosi [8]). Therefore, the path It is chosen randomly according to some distribution p t over 
all paths from u to v. We study the normalized regret over n rounds of the game 



where the minimum is taken over all paths i from u to v. 

For example, the exponentially weighted average forecaster (|29j. [23] . [7]), calculated over all 
possible paths, has regret 



with probability at least 1 — 5, where N is the total number of paths from u to v in the graph and 
K is the length of the longest path. 

3 The multi-armed bandit setting 

In this section we discuss the "bandit" version of the shortest path problem. In this setup, which is 
more realistic in many applications, the decision maker has only access to the losses corresponding 
to the paths it has chosen. For example, in the routing problem this means that information is 
available on the delay of the route the packet is sent on, and not on other routes in the network. 

We distinguish between two types of bandit problems, both of which are natural generalizations 
of the simple bandit problem to the shortest path problem. In the first variant, the decision maker 
has access to the losses of those edges that are on the path it has chosen. That is, after choosing a 
path It at time t, the value of the loss £ e t is revealed to the decision maker if and only if e £ It- 
We study this case and its extensions in Sections 01 and [6j 

The second variant is a more restricted version in which the loss of the chosen path is observed, 
but no information is available on the individual losses of the edges belonging to the path. That 



s=l s=l e£l s 




L n — min Li 
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is, after choosing a path It at time t, only the value of the loss of the path lj t ^ is revealed to the 
decision maker. Further on we call this setting as the restricted bandit problem for shortest path. 
We consider this restricted problem in Section [TJ 

Formally, the on-line shortest path problem in the multi-armed bandit setting is described as 
follows: at each time instance t = 1, ... ,n, the decision maker picks a path I t £ V from it to v. 
Then the environment assigns loss £ e t £ [0, 1] to each edge e G E, and the decision maker suffers 
loss £i ut = Yleei ^e,t- I n the unrestricted case the losses £ e ^ are revealed for all e £ It, while in 
the restricted case only lj u t is revealed. Note that in both cases £ e< t may depend on 1%, . . . ,It-i, 
the earlier choices of the decision maker. 

For the basic multi-armed bandit problem, Auer et al. [I] gave an algorithm, based on exponen- 
tial weighting with a biased estimate of the gains (defined, in our case, as g^t = K — £ij), combined 
with uniform exploration. Applying their algorithm to the on-line shortest path problem in the 
bandit setting results in a performance that can be bounded, for any < 5 < 1 and fixed time 
horizon n, with probability at least 1 — S, by 



1 /V \ 11K NMN/S) KlnN 
1 L n - minL; n < — — \ + 



n \ i£V 'J 2 V n 2n 

(The constants follow from a slightly improved version; see Cesa-Bianchi and Lugosi [8].) 

However, for the shortest path problem this bound is unacceptably large because, unlike in the 
full information case, here the dependence on the number of all paths N is not merely logarithmic, 
while N is typically exponentially large in the size of the graph (as in the two simple examples 
of Figure [1]). In order to achieve a bound that does not grow exponentially with the number of 
edges of the graph, it is imperative to make use of the dependence structure of the losses of the 
different actions (i.e., paths). Awerbuch and Kleinberg [4] and McMahan and Blum [25] do this 
by extending low complexity predictors, such as the follow-the-perturbed-leader forecaster [18] . 
|23j to the restricted bandit setting. However, in both cases the price to pay for the polynomial 
dependence on the number of edges is a worse dependence on the length n of the game. 



4 A bandit algorithm for shortest paths 

In this section we describe a variant of the bandit algorithm of pQ which achieves the desired 
performance for the shortest path problem. The new algorithm uses the fact that when the losses 
of the edges of the chosen path are revealed, then this also provides some information about the 
losses of each path sharing common edges with the chosen path. 

For each edge e £ E, and t = 1,2,..., introduce the gain g e> t = l — £ e j, and for each path i £ V , 
let the gain be the sum of the gains of the edges on the path, that is, 

9i,t = 9e,t ■ 

The conversion from losses to gains is done in order to facilitate the subsequent performance 
analysis. To simplify the conversion, we assume that each path i £ V is of the same length K 
for some K > 0. Note that although this assumption may seem to be restrictive at the first 
glance, from each acyclic directed graph (V, E) one can construct a new graph by adding at most 
{K — 2){\V\ — 2) + 1 vertices and edges (with constant weight zero) to the graph without modifying 
the weights of the paths such that each path from u to v will be of length K, where K denotes the 
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length of the longest path of the original graph. If the number of edges is quadratic in the number 
of vertices, the size of the graph is not increased substantially. 

A main feature of the algorithm below is that the gains are estimated for each edge and not 
for each path. This modification results in an improved upper bound on the performance with 
the number of edges in place of the number of paths. Moreover, using dynamic programming as 
in Takimoto and Warmuth [28], the algorithm can be computed efficiently. Another important 
ingredient of the algorithm is that one needs to make sure that every edge is sampled sufficiently 
often. To this end, we introduce a set C of covering paths with the property that for each edge 
e G E there is a path i £ C such that e £ i. Observe that one can always find such a covering set 
of cardinality \C\ < \E\. 

We note that the algorithm of pQ is a special case of the algorithm below: For any multi- 
armed bandit problem with N experts, one can define a graph with two vertices u and v, and N 
directed edges from u to v with weights corresponding to the losses of the experts. The solution 
of the shortest path problem in this case is equivalent to that of the original bandit problem with 
choosing expert i if the corresponding edge is chosen. For this graph, our algorithm reduces to the 
original algorithm of pQ. 



A BANDIT ALGORITHM FOR SHORTEST PATHS 



Parameters: real numbers /3 > 0, < rj, 7 < 1. 

Initialization: Set w e o = 1 for each e £ E, Wio = 1 for each i S V, and Wq = N . 
For each round t = 1,2,... 



(a) Choose a path It at random according to the distribution p t on V, defined by 




(b) Compute the probability of choosing each edge e as 



q e ,t = ^2 Pi ' f = ( x ~ 7) 



Wt-1 



+ 7 



\{i £ C : e £ i}| 



(c) Calculate the estimated gains 




(d) Compute the updated weights 
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The main result of the paper is the following performance bound for the shortest-path bandit 
algorithm. It states that the per round regret of the algorithm, after n rounds of play, is, roughly, 
of the order of KyJ\E\ hxN/n where \E\ is the number of edges of the graph, K is the length of the 
paths, and N is the total number of paths. 

Theorem 1 For any 5 E (0,1) and parameters < 7 < 1/2, < j3 < 1, and r/ > satisfying 
2rjK\C\ < 7, the performance of the algorithm defined above can be bounded, with probability at 
least 1 — 5, as 

1 (~ \ „, , K \E\ In AT , , 

- L„-minL iin < Kj + 2nK 2 C + — - In LA + + E f3. 

n \ iev J np nrj 

In particular, choosing (3 = -LA \ Q L^i > ^ = 2t]K\C\, and r\ = ^ 4 nK%c \ V^ds for all n > 
{^lnf ,4|C|lnJv}, 



max 



- I L n - min 

n V ieV 



Li^ < 2^- (^/AK\C\lnN + ^\E\]n^. 



The proof of the theorem is based on the analysis of the original algorithm of pQ with necessary 
modifications required to transform parts of the argument from paths to edges, and to use the 
connection between the gains of paths sharing common edges. 

For the analysis we introduce some notation: 



n 

Gi, n = 9i,t and G' in = ^ g' it 



t=i t=i 



for each % £ V and 



for each e £ E, and 



n n 



G e , n = ^ g e ,t and G' en = ^ g 1 ^ 



t=i t=i 



G n = y^9it,t- 

t=l 

The following lemma, shows that the deviation of the true cumulative gain from the estimated 
cumulative gain is of the order of y/n. The proof is a modification of [8, Lemma 6.7]. 



Lemma 1 For any 8 E (0, 1), < (3 < 1 and e £ E we have 



n n 1 1 1 I Ej I 
G e , n > G e n + — m — 



5 

< — 
~ E 



Proof. Fix e £ E. For any u > and c > 0, by the Chernoff bound we have 

P[G e , n > G' e n +u}< e - cu Ee c( - Ge > n - G '^ . (1) 
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Letting u = \n(\E\/S)//3 and c = 0, we get 



e -cu Ke c{Ge,n-G' e , n ) = e -H\E\/6) Ee (3(G e , n -G' e>n ) = J_ Ee /3(G e ,n-G^J ^ 



E 



so it suffices to prove that Ee^ Ge,n Ge ,™) < 1 for all n. To this end, introduce 

Z t = e ^- G ^ . 

Below we show that < Zt-\ for t > 2 where Ej denotes the conditional expectation 

E[-|Ii,...,J t _i] . Clearly, 

l{eg/ f }ffe,t + 
Qe,t 



Z t = Z t -! exp \(3[ 5e, 



Taking conditional expectations, we obtain 



exp /3 c/, 



l{ee/i}5e,i + /3 



Zt-ie 



< Zt-ie q ^Et 



exp [l3[ g, 



ie± 



<de,t 
^-{eel t }9e,t 



let 



Qe,t 



Qe,t 



< Z t -\e 1 + 



1 + 0- 

1 



2 ( 1 {eel t }9e,t 
Qe,t 



let 



< z 



t-1- 



(2) 
(3) 



(4) 



Here ([2]) holds since < 1, g e>i — 1{ee ^) 9e ^ < ^ anc q < 1 + x + x 2 for x < 1. ([3]) follows from 



E 



l{eg/i}9e,t 



^ e) t. Finally, (j4| holds by the inequality 1 + x < e x . Taking expectations on both 
sides proves K[Z t ] < R[Z t -i]. A similar argument shows that E[Zi] < 1, implying E[Z n ] < 1 as 
desired. □ 



Proof of Theorem Q3 As usual in the analysis of exponentially weighted average forecasters, we 
start with bounding the quantity In On the one hand, we have the lower bound 



In ==^ = In ^ G i,n —\nN>r] max G'- _ — In N . 
Wo tv ieP 



(5) 
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To derive a suitable upper bound, first notice that the condition rj < 2 k \ c\ i m P nes V9i t — 1 f° r 
all i and t, since 

, . ^l + P , yK(l + /3)\C\ 
V9i,t = V}^9e,t < — — < < 1 

_ . He,t I 

where the second inequality follows because q e t > j/\C\ for each e G E. 

Therefore, using the fact that e x < 1 + x + x 2 for all x < 1, for all i = 1,2,... we have 

In 3- = InVltle^ 



- ln z- — — v 1 + w *>* + 77 ff *'* 
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^ T^Z2^P^9i,t + Y—2^P^9i,t (7) 

' iev ' iev 

where ([6]) follows form the definition of pij, and ([7]) holds by the inequality ln(l + x) < x for all 
x > -1. 

Next we bound the sums in (JT]). On the one hand, 

^2Pi,t9i,t = ^2Pi,t^29'e,t = ^2g'e,t E Pi J 
iev ieV eei e£E ieV.eei 

= ^2g'e,tQe,t = gi u t + \E\p. 

eG-E 



On the other hand, 



i£V ieV \eei / 

< YjP^ K Yj9'lt 

ieV eGi 

= K ^9'l,t Yl P^ 

eG-E 

= K > g e>tff t 

< if(l + /3)^^ it 
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where the first inequality is due to the inequality between the arithmetic and quadratic mean, and 
the second one holds because g e< t < 1. Therefore, 

WV, 1-7 '"I a 



Summing for f = 1, . . . , n, we obtain 

Wo " 1-7 V- y ' 1-7 ee£ 

< ^(G n + n W )+^^\C\^G' i>n (8) 

where the second inequality follows since YleeE ^'e.n — SieC^in - Combining the upper bound 
with the lower bound ((Sj), we obtain 

G n > (1-7- r/if (1+/3)|C|) max ^kJV-n|£|j8. (9) 

Now using Lemma[Uand applying the union bound, for any 5 G (0, 1) we have that, with probability 
at least 1 — 5, 

G n > (1-7- VK(1 + /3)|C|) (maxG^-^ln^) - 1^ IniV - n|E|/? , 

\ tep p o y 7? 

where we used 1 — 7 — r)K(l + /3)|C| > which follows from the assumptions of the theorem. 
Since G n = Kn — L n and Gj jn = Kn — L^ n for all i £ V, we have 

L n < /rn(7 + r/(l + /3)ir|C|) + (l-7-r ? (l + /3)ir|C|)minL i) „ 

+ (1 - 7 - r?(l + /9)Jf|C|) % In ifl + ln iV + n|£|/3 

p d n 

with probability at least 1 — 5. This implies 

L n -minL in < Knj + n(l + j3)nK 2 \C\ + ^ ln ^ + ^— ^ IniV + n|£|/3 
ieP ' po t] 

< Kn<y + 2nnK 2 \C\ + — In— + +n|E|/3 

p 77 

with probability at least 1 — 5, which is the first statement of the theorem. Setting 



results in the inequality 



^ ln iV / I E I 

L„ - min Li n < 4nnK 2 \C\ H h 2\ nK\E\ hx —^- 

iav ' 77 V 



11 



which holds with probability at least 1 — S if n > (K/\E\)ln(\E\/5) (to ensure (3 < 1). Finally, 
setting 



IniV 



' AnK 2 \C\ 

yields the last statement of the theorem (n > 41niV|C| is required to ensure 7 < 1/2). □ 



Next we analyze the computational complexity of the algorithm. The next result shows that 
the algorithm is feasible as its complexity is linear in the size (number of edges) of the graph. 

Theorem 2 The proposed algorithm can be implemented efficiently with time complexity 0{n\E\) 
and space complexity 0(\E\). 



Proof. The two complex steps of the algorithm are steps (a) and (b), both of which can be 
computed, similarly to Takimoto and Warmuth [28J, using dynamic programming. To perform 
these steps efficiently, first we order the vertices of the graph. Since we have an acyclic directed 
graph, its vertices can be labeled (in 0(|i?|) time) from 1 to \V\ such that u = 1, v = \V\, and if 
(vi,V2) G E, then v± < V2- For any pair of vertices u\ < v\ let V Ul ,vi denote the set of paths from 
u\ to vi, and for any vertex s E V, let 

and 

Given the edge weights {w e ,t}> H t {s) can be computed recursively for s = \V\ — 1, . . . , 1, and H t (s) 
can be computed recursively for s = 2,...,\V\ in 0(| ) time (letting H t (v) = H t {u) = 1 by 
definition) . In step (a) , first one has to decide with probability 7 whether It is generated according 
to the graph weights, or it is chosen uniformly from C. If It is to be drawn according to the graph 
weights, it can be shown that its vertices can be chosen one by one such that if the first k vertices of 
It are vq = u,vi, . . . , Vk-i, then the next vertex of It can be chosen to be any Vk > Vk-i, satisfying 
(v k -i,v k ) e E, with probability iO(„ fe l)Ufe ) jt _ 1 iI t _i(uj fc )/F t _i(?; fe _i). The other computationally 
demanding step, namely step (b), can be performed easily by noting that for any edge (^1,^2); 

Q{vi,v2),t = (1-7) 



+ 7 

as desired. □ 



H t -i{u) 
\{i£C: (vi,v 2 ) G t}| 
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5 A combination of the label efficient and bandit settings 



In this section we investigate a combination of the multi- armed bandit and the label efficient 
problems. This means that the decision maker only has access to the loss of the chosen path upon 
request and the total number of requests must be bounded by a constant m. This combination is 
motivated by some applications, in which feedback information is costly to obtain. 

In the general label efficient decision problem, after taking an action, the decision maker has the 
option to query the losses of all possible actions. For this problem, Cesa-Bianchi et al. [9\ proved an 
upper bound on the normalized regret of order 0(Ky/ln(4N/5)/(m)) which holds with probability 
at least 1 — 5. 

Our model of the label-efficient bandit problem for shortest paths is motivated by an application 
to a particular packet switched network model. This model, called the cognitive packet network, 
was introduced by Gelenbe et al. [HI [12]. In these networks a particular type of packets, called 
smart packets, are used to explore the network (e.g., the delay of the chosen path). These packets 
do not carry any useful data; they are merely used for exploring the network. The other type of 
packets are the data packets, which do not collect any information about their paths. The task of 
the decision maker is to send packets from the source to the destination over routes with minimum 
average transmission delay (or packet loss). In this scenario, smart packets are used to query the 
delay (or loss) of the chosen path. However, as these packets do not transport information, there 
is a tradeoff between the number of queries and the usage of the network. If data packets are on 
the average a times larger than smart packets (note that typically a>l) and e is the proportion 
of time instances when smart packets are used to explore the network, then e/(e + a(l — e)) is the 
proportion of the bandwidth sacrificed for well informed routing decisions. 

We study a combined algorithm which, at each time slot i, queries the loss of the chosen path 
with probability e (as in the solution of the label efficient problem proposed in [9]), and, similarly 
to the multi-armed bandit case, computes biased estimates g' it of the true gains gn- Just as in the 
previous section, it is assumed that each path of the graph is of the same length K. 

The algorithm differs from our bandit algorithm of the previous section only in step (c) , which 
is modified in the spirit of [9j. The modified step is given below: 



MODIFIED STEP FOR THE LABEL EFFICIENT BANDIT ALGORITHM FOR 

SHORTEST PATHS 

(c') Draw a Bernoulli random variable St with P(S't = 1) = e, and compute the 
estimated gains 



The performance of the algorithm is analyzed in the next theorem, which can be viewed as a 
combination of Theorem Q] in the preceding section and Theorem 2 of [9] . 
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Theorem 3 For any 5 £ (0,1), e G (0,1], parameters r) = J ±nK^ \ C\ > T = 2r/K J C ^ < 1/2, and 

if 2 In 2 (2\E\/S) \E\ln(2\E\/5) 



1 

n> — max 
e 



-,4|C|lnJV 



|£|lniV ' K 

the performance of the algorithm defined above can be bounded, with probability at least 1 — 5, as 



- \L n - min V] £ i t ] 



%E\ 



< J— \WK\C\ In N + 5\ \E\\n-^- + J8Kln- + — In — 



AK . 2N 



3ne 



< 



27K \E\ln 



2N 



11 € 



If e is chosen as (m — ^J2m ln(l/5))/n then, with probability at least 1 — 5, the total number 
of queries is bounded by m (see [8] Lemma 6.1]) and the performance bound above is of the order 
of K v / \E\ln(N/5)/m. 

Similarly to Theorem[IJ we need a lemma which reveals the connection between the true and the 
estimated cumulative losses. However, here we need a more careful analysis because the "shifting 
term" -@—S+, is a random variable. 

Lemma 2 For any < <5 < 1, < e < 1, for any 



1 

n > — max 
e 



K 2 ln 2 {2\E\/5) K\n(2\E\/5)\ 



\E\]nN 



\E\ 



parameters 



2r]K\C\ 



<7> 



' elniV 
AnK 2 \C\ 



and (3 



K , 2\E\ 

In -V 1 < 1 > 



and e £ E, we have 



< 



2\E\ 



Proof. Fix e £ E. Using ([IJ) with u = 4- In 2 -j^ and c = ^r, it suffices to prove for all n that 



/3c 



/3c 



E 



,c(G e ,„-G' ) 



< 1 . 



Similarly to Lemma [T] we introduce Zt = e c ^ Ge,t ~ Ge < t ' > and we show that Z\, . . . ,Z n is a super- 
martingale, that is E t [Z 4 ] < Z t -i for t > 2 where K t denotes E[-|(Ji, Si), . . . , (It-i, St-i)]- Taking 
conditional expectations, we obtain 



Zt-iE t 



c 9e,t- 
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1 + c [g e , 



1 {e£l t }St9e,t + S t /3 



C 9e,t 



^{eeI t }St9e,t + 



Qe,t£ 



Since 



and 



9e,t 



l{ e€ i t }Stg e ,t + S t /3 



E, 



5e,i 



1 {e£Jt}'S'tge,< 



<E t 



.A 

1e,t 



^{e£lt}St9e,i 
9e,t£ 



< 



9e,te 



we get from (jlOp that 

< ^t-i (l 



c(3 



+ + c 



2 / 2~k{eei t }St9e,t/3 2g e>t S t /3 S t /3 



Qe,t 9e,te 



Qe,t 



-f3+- + c(3[- + 



a 2 f 2 

(3 



9e,te 



+ 



n 2 f 2 



Since c = /3e/4 we have 



-/? + - + c/3 - + 



(3 



4 4 \e g e ,te 

3/? /? 2 /3 3 
' 4 + 2 + 4g e , 



4 4g 6ji 

^ g , /3 3 |g| 

4 47 

< 0, 



where (ll'ip follows from q e j > j^r and (|13p holds by 



(10) 



(11) 



(12) 
(13) 



7 - 2i]K ~ ' 

and the last inequality is ensured by n > — ^[fj^^ , the assumption of the lemma. 

Combining (llip and (113p we get that Ef[Zt] < Z%—\. Taking expectations on both sides of the 
inequality, we get K[Z t ] < E[Z t -i] and since E[Zi] < 1, we obtain K[Z n ] < 1 as desired. □ 



Proof of Theorem [3]. The proof of the theorem is a generalization of that of Theorem [IJ and 
follows the same lines with some extra technicalities to handle the effects of the modified step (c'). 



15 



Therefore, in the following we emphasize only the differences. First note that ([5]) and ([7]) also hold 
in this case. Bounding the sums in (JT]), one obtains 

^2Pi,t9'i,t = ^(9i t ,t + \E\l3) 



iev 



and 



Plugging these bounds into (|7|) and summing for t = l,...,n, we obtain 



ln p^ ^ i^^7(Wt.* + W) + -(i — ICImaxf,,. 



V^o ~ e """" ' ' " ' (l"7)e 

Combining the upper bound with the lower bound ©, we obtain 

V^-St, lr , 1(3 , / T/if(l + /?)|C|\ _ In 
J2-(9i t ,t + \E\P) > 1-7- — — m«^ n 



t=i 



N 



To relate the left-hand side of the above inequality to the real gain ^"=i 5/ t ,t> notice that 

X t = ^(g Ia + \E\P)-(g It>t + \E\P) 

is a martingale difference sequence with respect to (Ji, Si), (I2, S2), ■ ■ ■■ Now for all t = 1, 
we have the bound 



E[A f 2 |(I lj 5'i),...,(7 t _i,5 t _i)] < E 



^{gi u t + \E\. 



(Ii,Si), . . . , (It-i, St-i) 



< 



(K + \E[ 



^ 4K" def 2 



(14) 



,n, 



(15) 



where f|15|) holds by n > — '"^l" 5 !/*^ (to ensure /3\E\ < -fT). We know that 



-2/v. [ 1| 2/v" 



for all t. Now apply Bernstein's inequality for martingale differences (see Lemma[9]in the Appendix 
to obtain 

n - 

-t=l 



< 



(16) 



where 



u = \ 2n- 



4K 2 



ln(-]+^ln|- 
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From (|16p we get 



5, 



]T - (gi u t + \E\/3) >G n + Pn\E\ + ■ 



i=l 



5 

< - 
~ 2 



(17) 



Now Lemma [21 the union bound, and (|17|) combined with (|14p yield, with probability at least 
1-5, 



G n > 1-7 



In AT 



nK(l + (5)\C\ \ ( 



max 



4K , 2\E\ 
J7 ln — 



n 



(3n\E\ — u 



since the coefficient of G\ is greater than zero by the assumptions of the theorem. 
Since G n = Kn — L n and Gi :U = Kn — Li^ n , we have 



L„, < 1-7 



K 



+ 1-7 



^ ^J- 1 mm L iin + Kn 7 + 

e / ieP 

#(l + p>/|C|\ 4iT 2|£7 



In AT 



— ln-t—t + f3n\E\ + 
pe T) 



+ u 



< min Li n + ATn I 7 



AT(1 + 



, +5f3n\E\ H ha , 

/ »7 



where we used the fact that £ ln = pYt|.E|. 
Substituting the value of 0, n and 7, we have 



- r 2AT7/ICI r „ 2AT?7|C| lnAT „ 

L n - mm Li n <Kn — - + Kn — - H h 5praLE + it 

ieP e en 



n\C\\nN n\E\K\n{2\E\/5) 
<4ATa/ — + 5W_!_! v ' 17 ' +u 



nK 

€ 

4K 



<\j— [±y/K\C\\n.N + by/\E\\n.(2\E\/S) + ^%K\n{2/8) 



+ —111(2/5) 



as desired. □ 



6 A bandit algorithm for tracking the shortest path 

Our goal in this section is to extend the bandit algorithm so that it is able to compete with time- 
varying paths under small computational complexity. This is a variant of the problem known 
as tracking the best expert; see, for example, Herbster and Warmuth [19], Vovk [30], Auer and 
Warmuth [2j, Bousquet and Warmuth [6], Herbster and Warmuth |20j . 

To describe the loss the decision maker is compared to, consider the following "m-partition" 
prediction scheme: the sequence of paths is partitioned into m+1 contiguous segments, and on each 
segment the scheme assigns exactly one of the N paths. Formally, an m-partition Part(n, m, t,i) of 
the n paths is given by an m-tuple t = (t±, . . . , t m ) such that to = 1 < t\ < ■ • ■ < t m < n+ 1 = t m+ \, 
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and an (m + l)-vector i = (io, . . . , i m ) where ij E V . At each time instant t, tj < t < tj+i, path ij 
is used to predict the best path. The cumulative loss of a partition Part(n, m,t,i) is 



The goal of the decision maker is to perform as well as the best time- varying path (partition), 
that is, to keep the normalized regret 



as small as possible (with high probability) for all possible outcome sequences. 

In the "classical" tracking problem there is a relatively small number of "base" experts and the 
goal of the decision maker is to predict as well as the best "compound" expert (i.e., time- varying 
expert). However in our case, base experts correspond to all paths of the graph between source and 
destination whose number is typically exponentially large in the number of edges, and therefore we 
cannot directly apply the computationally efficient methods for tracking the best expert. Gyorgy, 
Linder, and Lugosi [15] develop efficient algorithms for tracking the best expert for certain large 
and structured classes of base experts, including the shortest path problem. The purpose of the 
following algorithm is to extend the methods of [15] to the bandit setting when the forecaster only 
observes the losses of the edges on the chosen path. 



L(Part(n, m, 



t,s)) = E E v = E E Ev 



j=0 t=tj j=Q t=tj e£ij 
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A BANDIT ALGORITHM FOR TRACKING SHORTEST PATHS 
Parameters: real numbers j3 > 0, < 77, 7 < 1, < a < 1. 

Initialization: Set w e $ = 1 for each e e E, w^q = 1 for each i G V, and Wq = N. 
For each round t = 1, 2, . . . 

(a) Choose a path J t according to the distribution p t defined by 



Pi,t 



(b) Compute the probability of choosing each edge e as 

" ^ Pi,* " (1 " l) =^ + 7 j£j ■ 



(c) Calculate the estimated gains 



J — ) Qe,t 

9e,t - \ (3 



if e G J t ; 



-2- otherwise. 

<?e,t 



(d) Compute the updated weights 

Vi,t = Wij-ie 19 ^* 

w ijt = (1 - a)v itt + —W t 

where g\ t = J2 e ei 9e,t an d W"t is the sum of the total weights of the paths, that 
is, 

W t = Y J 1J i,t = Y J W iJ- 

iev ieV 



The following performance bounds shows that the normalized regret with respect to the best 



time- varying path which is allowed to switch paths m times is roughly of the order of KsJ (m/n) \C\ In N. 

Theorem 4 For any 5 G (0, 1) and parameters < 7 < 1/2, a, (3 G [0, 1], and n > satisfying 
2rjK\C\ < 7, the performance of the algorithm defined above can be bounded, with probability at 
least 1 — 5, as 

L n — min L(Part(n, m, t, i)) 

t,i 

p 

1 / jy m+1 
+ f5n\E\ + -In 



i] \a m (l - a)"-™- 1 
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In particular, choosing 



/ g(m + l) |£|(m + l) 



n\E\ 5 n — 1 



and 



(m + 1) ln N + m ln 



e(n-l) 

TO 



' V 4raAT 2 |C| 
we have, for all n > max | K<k \e \ ^ m \ E ^™ +1 ^ , 4\C\D^, 



L n - minL(Part(n,m,t,i)) < 2VnK ^AK\C\D + W \E\(m + l)ln 



/ , r-r- /, ., « £(m + l) 

v i/\ cjr — 



t,2 \ V 

w/iere 



71 — 1 

£> = ( m + 1) In AT + m ( 1 + ln 



The proof of the theorem is a combination of that of our Theorem Q] and Theorem 1 of [15J. We 
will need the following three lemmas. 

Lemma 3 For any 1 < t < t' < n and any i 6 V, 



^>e^W (1 -«)*'-* 



Proof. The proof is a straightforward modification of the one in Herbster and Warmuth [19] . From 
the definitions of v^t and w^t (see step (d) of the algorithm) it is clear that for any r > 1, 



w iiT = (1 - a)v iiT + —W T > (1 - a)e Wi .^ i>T _i 



Applying this equation iteratively for r = t, t + 1, . . . ,t' — 1, and the definition of (step (d)) for 
t = t', we obtain 



t'-l 

r ir W^_ ie <*' > e<*' [] ((1 - a)e<A w i}t ^ 



T=t 

. Nt'-t— 

t-1 



e " G W](l-a)*'- t tZ7 i 



which implies the statement of the lemma. □ 



Lemma 4 For any t > 1 and i, j G "P, we /iai>e 
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Proof. By the definition of wa we have 

ol o. a 

w i>t = (1 - a)v i>t + — W t > —W t > —Vj,t ■ 

This completes the proof of the lemma. □ 

The next lemma is a simple corollary of Lemma [TJ 
Lemma 5 For any 5 S (0,1), < (3 < 1, t > 1 and e 6 E we have 

5 



< 



\E\{m + l) 



Proof of Theorem [4l The theorem is proved the same way as Theorem [T] until ([8]), that is, 

infi < J_(^, + BW ) + £SlM|C|^Gi,. (18) 
Wo 1 — 7V / 1 — 7 igP 

Let Part(n, m, t,i) be an arbitrary partition. Then the lower bound is obtained as 

ln^=lny^=lnV ^>ln^ (19) 

(recall that i m denotes the path used in the last segment of the partition). Now t>i mjn can be 
rewritten in the form of the following telescoping product 

m 



Vi m ,n = Wi ,t -1= = = 



i=i v " J - 1 '" J 
Therefore, applying Lemmas [3] and [H we have 

m 

^ > ^ lt o-i(|) w n(( i -^ +i " i ^ e ^' [ * i ' ti+i " 1 ) 

Combining the lower bound with the upper bound (|18p . we have 



, /a m (l-a) n - m - 1 \ 

ln ( ]v™+i J + n ^ X7 ? G (Part(n,m,t,i)) 

< J- (G n + + 2^±%| max ieP Gj, n , 



where we used the fact that Part(ra, m, t,i) is an arbitrary partition. After rearranging and using 
maxjgp G ■ n < max tj iG / (Part(n,m,t,i)) we get 

G n > (1 -7-r ? K(l + /3)|C|)maxG'(Part(n,m,t,i)) 

t,i 

-n\E\p -In' 



a m (l - a) 71 '™- 1 
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Now since 1 — 7 — nK(l + 0)\C\ > 0, by the assumptions of the theorem and from Lemma [5] with 
an application of the union bound we obtain that, with probability at least 1 — 5, 

G n > (1 - 7 - r)K(l + p)\C\) ( maxG(Part(n, m, t,i)) - ^ (?7 l + 1} In + 

\ t,i p 

-n|£|/3 '-In 1 



Since G n = Kn — L n and G(Part(n, m, t, i)) = Kn — L(Part(n, m, t,i)), we have 

L n < (1 — 7 — r]K(l + /3)|C|) min L (Part (n,m,t,i)) + Kn (7 + rj(l + |C|) 



t.i 



+ (i - 7 - ,(i + 0)k\c\) ln M^±ll + n w 

p 5 

1 / N m+1 
+ - ln 



77 V"""! 1 - a) n_m_1 . 
This implies that, with probability at least 1 — 5, 

L n — min L(Part(n, m, t, i)) 



t.i 



< K„( 7 + ,(l +WC |) + ^llnM|±i) 

1 / jv m+1 \ 
+ "l^+-'"( gm(1 _„)„-„-. )■ P0) 

To prove the second statement, let H(p) = — plnp — (1 — p) ln(l — p) and D(p \\ q) = pin - + 
(1 — p) ln jE^j- Optimizing the value of a in the last term of (|20p gives 

1 / flrm+l 
— ln 



i] \a m (l - a) 71 -" 1 ' 1 

if 1 . 1 

— [m + 1) ln (N) + m ln h (n — m — 1) ln 

77 \ a 1 — a 

-((m+l)]n(N) + (n-l)(D b (a* \\ a) + H b (a*))) 
n ' 



where a* = For a = a* we obtain 



1 / j\r m+1 
— ln 



77 ^a m (l - a)"-™- 1 

= -((m+l)]n{N) + (n-l){H b (a*))) 
V 

= - ((m+ l)ln(iV) + mln((n- l)lm) 
V 

+ (n — m — l)ln(l + m/(n — m — 1))) 

< — ((m + 1) ln (N) + mln((n — l)/m) + m) 
V 
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= - ( (m + 1) In (AO + m In e(n 1} ^) d ^ f -D 
r] \ m J r] 

where the inequality follows since ln(l + x) < x for x > 0. Therefore 

L n — minL(Part(n, m,t,i)) 

t,i 

< Kn( 1 + V (l + ff)K\C\) + In * + + n W + Ifl . 

p o V 

which is the first statement of the theorem. Setting 



K(m + l) t \E\(m + l) „,„, , / D 



n|£| 5 ' ' ' 1 " ' y AnK 2 \C\ 

results in the second statement of the theorem, that is, 
L n — minL(Part(n,m,t,i)) 



< 2V^K \JaK\C\D + ]J \E\ {m + 1) ln @5|±2j 



□ 



Similarly to |15j . the proposed algorithm has an alternative version, which is efficiently 
putable: 



AN ALTERNATIVE BANDIT ALGORITHM FOR TRACKING SHORTEST 

PATHS 

For t = 1, choose I\ uniformly from the set V. For t > 2, 

(a) Draw a Bernoulli random variable with P(Tt = 1) = 7. 

(b) If Tt = 1, then choose It uniformly from C. 

(c) If T t = 0, 

(cl) choose Tt randomly according to the distribution 



P{r t = t'} = < 



W t _i iUi 1 1 

q(l- tt )'-''|y t ,Z,,,. 1 f nrf '- 9 f 
NWt ~ ' ' ' ' ' 



where Z^-i = £ ieP e^V-H for f = 1, . . . , t - 1, and Z M _! = A^; 
(c2) given Tt = t', choose It randomly according to the probabilities 

: { I, - i\ Ti - !'} - { 6 i'l-i 11 for *' = !.•••, *-l 
i for t' = t. 
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In a way completely analogous to [15], in this alternative formulation of the algorithm one can 
compute the probabilities ¥{It = ifa = t 1 } and the normalization factors Zf t_i efficiently. Using 
the fact that the baseline bandit algorithm for shortest paths has an 0(n\E\) time complexity by 
Theorem [2l it follows from Theorem 3 of [15] that the time complexity of the alternative bandit 
algorithm for tracking the shortest path is 0(n 2 \E\). 

7 An algorithm for the restricted multi-armed bandit problem 

In this section we consider the situation where the decision maker receives information only about 
the performance of the whole chosen path, but the individual edge losses are not available. That 
is, the forecaster has access to the sum lj t ^ of losses over the chosen path It but not to the losses 
{£ e j}e£i t °f the edges belonging to If- 

This is the problem formulation considered by McMahan and Blum [25] and Awerbuch and 
Kleinberg [4j. McMahan and Blum provided a relatively simple algorithm whose regret is at 
most of the order of n _1//4 , while Awerbuch and Kleinberg gave a more complex algorithm to 
achieve 0(n _1//3 ) regret. In this section we combine the strengths of these papers, and propose 
a simple algorithm with regret at most of the order of n -1 / 3 . Moreover, our bound holds with 
high probability, while the above-mentioned papers prove bounds for the expected regret only. The 
proposed algorithm uses ideas very similar to those of McMahan and Blum [25J. The algorithm 
alternates between choosing a path from a "basis" B to obtain unbiased estimates of the loss 
(exploration step), and choosing a path according to exponential weighting based on these estimates. 

A simple way to describe a path i 6 V is a binary row vector with \E\ components which are 
indexed by the edges of the graph such that, for each e £ E, the eth entry of the vector is 1 if e E i 
and otherwise. With a slight abuse of notation we will also denote by i the binary row vector 
representing path i. In the previous sections, where the loss of each edge along the chosen path 
is available to the decision maker, the complexity stemming from the large number of paths was 
reduced by representing all information in terms of the edges, as the set of edges spans the set of 
paths. That is, the vector corresponding to a given path can be expressed as the linear combination 
of the unit vectors associated with the edges (the eth component of the unit vector representing 
edge e is 1, while the other components are 0). While the losses corresponding to such a spanning 
set are not observable in the restricted setting of this section, one can choose a subset of V that 
forms a basis, that is, a collection of b paths which are linearly independent and each path in V 
can be expressed as a linear combination of the paths in the basis. We denote by B the b x \E\ 
matrix whose rows b 1 , . . . , b b represent the paths in the basis. Note that b is equal to the maximum 
number of linearly independent vectors in {i : i 6 V}, so b < \E\. 

Let l\ denote the (column) vector of the edge losses {£ e j} e eE at time t, and let i\ = 
(l b i t ,...,£ b b t ) T be a 6-dimensional column vector whose components are the losses of the paths 

in the basis B at time t. If a b \ B \ . . . ,a^ B ^ are the coefficients in the linear combination of the 

basis paths expressing path i 6 "P, that is, i = Y?j=i a & B ^> then the loss of path i € V at time t 
is given by 

kt = ad*) =Y,a% B \vA E) ) =i>irv* ( 2i ) 

3=1 i=i 

where (•, •) denotes the standard inner product in K^l. In the algorithm we obtain estimates ly t 
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of the losses of the basis paths and use (|2ip to estimate the loss of any i £ V as 



3=1 



It is algorithmically advantageous to calculate the estimated path losses £ij from an intermediate 
estimate of the individual edge losses. Let B + denote the the Moore- Penrose inverse of B defined by 
B + = B T (BB T )~ 1 , where B T denotes the transpose of B and BB T is invertible since the rows of 

B are linearly independent. (Note that B + = B 1 if b = \E\). Then letting l{ B) = (£ b i t , ...Jy, t ) T 
and 

i ( t E) = B+£ { t B) 

it is easy to see that in ([22]) can be obtained as = {i,£ t )? or equivalently 



This form of the path losses allows for an efficient implementation of exponential weighting via 
dynamic programming [28]. 

To analyze the algorithm we need an upper bound on the magnitude of the coefficients oiy B '. 
For this, we invoke the definition of a barycentric spanner from |4j: the basis B is called a C- 
barycentric spanner if < C for all i £ V and j = 1, . . . , b. Awerbuch and Kleinberg [3] show 

that a 1-barycentric spanner exists if B is a square matrix (i.e., b = \E\) and give a low-complexity 
algorithm which finds a C-barycentric spanner for C > 1. We use their technique to show that a 
1-barycentric spanner also exists in case of a non-square B, when the basis is chosen to maximize 
the absolute value of the determinant of BB T . As before, b denotes the maximum number of 
linearly independent vectors (paths) in V . 

Lemma 6 For a directed acyclic graph, the set of paths V between two dedicated nodes has a 1- 
barycentric spanner. Moreover, let B be abx \E\ matrix with rows from V such that det[BB T ] ^ 0. 
If B^j i is the matrix obtained from B by replacing its jth row by i G V and 

|det [B_ hi B T _j^ | < C 2 |det [BB T ] \ (23) 

for all j = 1, . . . ,6 and i £ V, then B is a C-barycentric spanner. 



Proof. Let B be a basis of V with rows 6 1 , . . . , b b G V that maximizes \&&\BB T \\. Then, for 
any path i G V, we have i = X^=i oty W for some coefficients {a^' B ^}. Now for the matrix 
B-u = [i T , (b 2 ) T , (b b ) T ] T we have 



|det [B-uBZi. 



det lB_i.it- 1 ,B^i(by ,B_i,<(fr 

T 



3\T 



,B_i,i(&' 



b\T 



det 



J2 4' B) B-i,iV ] , B_!,i(6 2 f , S_i,i(5 3 ) T , . . . , B_ 1)4 (6f 



6nT 



U =1 
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«2' B) det B^Vf, B- M (6 2 ) T , S_ 1 , i (b 3 ) T , . . . , B^i{b h ) 



3=1 



| det 



(i,B) 



| det [BB 



Tl 



6nT 



where last equality follows by the same argument the penultimate equality was obtained. Repeating 
the same argument for B_j i, j = 2, . . . , b we obtain 



| det [B-jjB 



j,*-"-j,iJ 



n 



| det [BB 



Tl 



(24) 



Thus the maximal property of |det[_BI? T ]| implies | 
statement follows trivially from (|23p and (|24p , □ 



< 1 for all j = 1, . . . , b. The second 



Awerbuch and Kleinberg |4j also present an iterative algorithm to find a C-barycentric spanner 
if B is a square matrix. Starting from the identity matrix, their algorithm replaces a row of the 
matrix in each step by maximizing the determinant with respect to the given row. This is done by 
calling an oracle function, and it is shown that the oracle is called 0{b\og c b) times. In case B is 
not a square matrix, the algorithm carries over if we have access to an alternative oracle that can 
maximize | det [BB T ] \ : Starting from an arbitrary basis B we can iteratively replace one row in 
each step, using the oracle, to maximize the determinant | det[-B£? T ]| until f|23[) is satisfied for all j 
and i. By Lemma El this results in a C-barycentric spanner. Similarly to pE], it can be shown that 
the oracle is called 0{b\og c b) times for C > 1. 

For simplicity (to avoid carrying the constant C), assume that we have a 2-barycentric spanner 
B. Based on the ideas of label efficient prediction, the next algorithm gives a simple solution to 
the restricted shortest path problem. The algorithm is very similar to that of the algorithm in the 
label efficient case, but here we cannot estimate the edge losses directly. Therefore, we query the 
loss of a (random) basis vector from time to time, and create unbiased estimates t of the losses 
of basis paths t , which are then transformed into edge-loss estimates. 
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A BANDIT ALGORITHM FOR THE RESTRICTED SHORTEST PATH 

PROBLEM 

Parameters: < e,r] < 1. 

Initialization: Set w e< o = 1 for each e G E, Wifi = 1 for each i G V, Wq = N. Fix a 
basis B, which is a 2-barycentric spanner. For each round i = 1,2,... 

(a) Draw a Bernoulli random variable St such that P(<St = 1) = e; 

(b) If St = 1, then choose the path It uniformly from the basis B. If St = 0, then 
choose It according to the distribution {pij}, defined by 

Wit-l 

Pi,t = = • 



IT" 



t-i 



(c) Calculate the estimated loss of all edges according to 

if = b*-^ , 

where i[ ^ = {$^t}e£E, and i\ ^ = • • • > ^ s ^ e vec t° r of the estimated 

losses 

for j = 1, . . . ,b. 

(d) Compute the updated weights 



We,t = W et t-ie 

and the sum of the total weights of the paths 

W t = ^2m,t ■ 



e.t 



The performance of the algorithm is analyzed in the next theorem. The proof follows the argu- 
ment of Cesa-Bianchi et al. [9], but we also have to deal with some additional technical difficulties. 
Note that in the theorem we do not assume that all paths between u and v have equal length. 

Theorem 5 Let K denote the length of the longest path in the graph. For any 5 G (0, 1), parameters 
< e < and rj > satisfying r\ < e 2 , and n > ^ In , the performance of the algorithm defined 
above can be bounded, with probability at least 1 — 5, as 



. , r l vb^ In. 4 \J 2neln l 16, /I 6, 46AM lnA^ 
L n - mm L in <K\ —Kn + \ - In - + ne + 1 few 2n- In — — + 



iev ' " e V 2 5 K 3 \ e 6 r) 
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In particular, choosing 



we obtain 



'Kb, 4bN\ 1/3 

e = | In — — ana n = e 

n o I 



1/3 _2/3 



L n - mmL itn < VAK^b (Kbln(4bN/6)) 1/6 n 
iev 



The theorem is proved using the following two lemmas. The first one is an easy consequence of 
Bernstein's inequality: 

Lemma 7 Under the assumptions of Theorem^ the probability that the algorithm queries the basis 
more than ne + J 2ne ln | times is at most 5/4. 

Using the estimated loss of a path i G V given in (|22p . we can estimate the cumulative loss of 
% up to time n as 



t=i 



The next lemma demonstrates the quality of these estimates. 

Lemma 

1 - 5 /4, 



Lemma 8 Let < 5 < 1 and assume n > — ln ^E For any i E V , with probability at least 



°, bK 2 46 

t=i ieP t=i ieP 



Furthermore, with probability at least 1 — 5/(4iV), 



WT 2 , 4WV 



Proof. We may write 



n n 

Yj E PiA,t -EE p*A* 

t=l ieP t=l ieP 



def 



ee^e^^-v* 

t=i ieP j=i 

6 n 

EE 

3=1 *=1 
b n 

EE***- 

3=1 *=1 



(25) 



Note that for any b 3 , Xy t , t = 1, 2, . . . is a martingale difference sequence with respect to (Jt, 5f) 
* = 1,2,... as E t L t = £ b +. Also, 



VieP / 



■b>,t 



^EM<v 



ieP 



< 4- 
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and 



\ x v,t\ ^ 



E^S 



(t,B) 



iev 



-v>,t *-v,t\ 



< 



52 Pit 



a 



(i,B) 



Kb Kb 
< 2 — 

e e 



(27) 



where the last inequalities in both cases follow from the fact that B is a 2-barycentric spanner. 
Then, using Bernstein's inequality for martingale differences (Lemma [SJ, we have, for any fixed V , 



V- v 8 / bK 2 , 46 

EW 5 V*-r '"7 

t=l 



- 46 



(28) 



where we used (|26p . (|27p and the assumption of the lemma on n. The proof of the first statement 
is finished with an application of the union bound and its combination with (j25j) . 
For the second statement we use a similar argument, that is, 



£6 



*=i j=i t=i 



6 

* 2 E 



E 

E(^y.t ~~ ^b», 



t=i 



(29) 



Now applying Lemma [9] for a fixed V we get 



„ \ ± L K 2 b, 46iV 



< 



46iV 



(30) 



because of 1^(4^ - ^ J 2 ] < ^ and -K < 4*,* - ^ <K[\-l). The proof is completed by 
applying the union bound to (|30p and combining the result with (|29p . □ 



Proof of Theorem O Similarly to earlier proofs, we follow the evolution of the term In =*-. In 
the same way as we obtained © and © , we have 

In > —77 min Li n — In iV 

Wo ieV ' 

and 

t=i V ie-p iGP / 

Combining these bounds, we obtain 



fiv Li ' n ~^T - e(~E^a* + |E^A 



t=i \ iep ieP 

n 

^ i 11 !\ h \ v 



) EE^A* , 

/ t=i iev 
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because < li t < ^ip- Applying the results of Lemma [8] and the union bound, we have, with 
probability 1 — 5/2, 



• , , ,„ K% \h.\ 
- mm L i n - -b\ 2n In — — 

iev 3 V e 5 



, , nKb\ fl 8, / K 2 b 1 4b\ In AT 

. nKb\^^ „ 8, / if 2 6, 46 In AT , , 

' t=i ieV 1 



Introduce the sets 



T n = f {t : 1 < t < n and St = 0} and T n {i : 1 < t < n and St = 1} 

of "exploitation" and "exploration" steps, respectively. Then, by the Hoeffding-Azuma inequality 
|21j we obtain that, with probability at least 1 — 5/4, 



iGT„ ieP iGT„ 

Note that for the exploration steps t G T n , as the algorithm plays according to a uniform distribu- 
tion instead of pa, we can only use the trivial lower bound zero on the losses, that is, 



teT n i£P teTn 



The last two inequalities imply 



\T\K 2 4 _ 
£)5>iAt>£n- ln A -^l T «l ■ ( 32 ) 

Then, by (13ip . (|32|) and Lemma [7] we obtain, with probability at least 1 — 5, 
L n - min Ls n 



. , nb„ In, 4 J2neln| 16 I b 4bN \ lnN 
< K — Kn + \ -\n-+ne+- i — 1 b\ 2n-\n—^\ + 



e V 2 8 K 3Ve 5 n 

where we used L n < Kn and \T n \ < n. Substituting the values of e and n gives 

1 1 16 

Ln — min L,- „ < K 2 bne H — Kne + iTne H — ne H bvKne + ne 

i&v ' 4 2 3 

< 9.1K 2 bne 



where we used y§hi| < ^ne, y2raeln| < \ne, yn^-ln^j- = ne, and < ne (from the 
assumptions of the theorem). □ 
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8 Simulation results 



To further investigate our new algorithms, we have conducted some simple simulations. As the main 
motivation of this work is to improve earlier algorithms in case the number of paths is exponentially 
large in the number of edges, we tested the algorithms on the small graph shown in Figure [T] (b), 
which has one of the simplest structures with exponentially many (namely 2 \ E I/ 2 ) paths. 

The losses on the edges were generated by a sequence of independent and uniform random 
variables, with values from [0, 1] on the upper edges, and from [0.32, 1] on the lower edges, resulting 
in a (long-term) optimal path consisting of the upper edges. We ran the tests for n = 10000 steps, 
with confidence value 5 = 0.001. To establish baseline performance, we also tested the EXP3 
algorithm of Auer et al. [1] (note that this algorithm does not need edge losses, only the loss of the 
chosen path). For the version of our bandit algorithm that is informed of the individual edge losses 
(edge-bandit), we used the simple 2-element cover set of the paths consisting of the upper and 
lower edges, respectively (other 2-element cover sets give similar performance). For our restricted 
shortest path algorithm (path-bandit) the basis {uuuuu,uuuul,uuull,uulll,ullll} was used, where 
u (resp. I) in the kth position denotes the upper (resp. lower) edge connecting v^-i and v^. In 
this example the performance of the algorithm appeared to be independent of the actual choice of 
the basis; however, in general we do not expect this behavior. Two versions of the algorithm of 
Awerbuch and Kleinberg [4] were also simulated. With its original parameter setting (AwKl), the 
algorithm did not perform well. However, after optimizing its parameters off-line (AwKl tuned), 
substantially better performance was achieved. The normalized regret of the above algorithms, 
averaged over 30 runs, as well as the regret of the fixed paths in the graph are shown in Figure [2j 

Although all algorithms showed better performance than the bound for the edge-bandit algo- 
rithm, the latter showed the expected superior performance in the simulations. Furthermore, our 
algorithm for the restricted shortest path problem outperformed Awerbuch and Kleinberg's (AwKl) 
algorithm, while being inferior to its off-line tuned version (AwKl tuned). It must be noted that 
similar parameter optimization did not improve the performance of our path-bandit algorithm, 
which showed robust behavior with respect to parameter tuning. 

9 Conclusions 

We considered different versions of the on-line shortest path problem with limited feedback. These 
problems are motivated by realistic scenarios, such as routing in communication networks, where 
the vertices do not have all the information about the state of the network. We have addressed the 
problem in the adversarial setting where the edge losses may vary in an arbitrary way; in particular, 
they may depend on previous routing decisions of the algorithm. Although this assumption may 
neglect natural correlation in the loss sequence, it suits applications in mobile ad-hoc networks, 
where the network topology changes dynamically in time, and also in certain secure networks that 
has to be able to handle denial of service attacks. 

Efficient algorithms have been provided for the multi-armed bandit setting and in a combined 
label efficient multi-armed bandit setting, provided the individual edge losses along the chosen 
path are revealed to the algorithms. The normalized regrets of the algorithms, compared to the 
performance of the best fixed path, converge to zero at an 0(1/ y/n) rate as the time horizon n 
grows to infinity, and increases only polynomially in the number of edges (and vertices) of the 
graph. Earlier methods for the multi-armed bandit problem either do not have the right 0(1/ y/n) 
convergence rate, or their regret increase exponentially in the number of edges for typical graphs. 
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Figure 2: Normalized regret of several algorithms for the shortest path problem. The gray dotted 
lines show the normalized regret of fixed paths in the graph. 
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The algorithm has also been extended so that it can compete with time varying paths, that is, to 
handle situations when the best path can change from time to time (for consistency, the number 
of changes must be sublinear in n). 

In the restricted version of the shortest path problem, where only the losses of the whole paths 
are revealed, an algorithm with a worse 0(n -1 / 3 ) normalized regret was provided. This algorithm 
has comparable performance to that of the best earlier algorithm for this problem [JJ, however, 
our algorithm is significantly simpler. Simulation results are also given to assess the practical 
performance and compare it to the theoretical bounds as well as other competing algorithms. 

It should be noted that the results are not entirely satisfactory in the restricted version of the 
problem, as it remains an open question whether the 0(1/ ^/n) regret can be achieved without the 
exponential dependence on the size of the graph. Although we expect that this is the case, we have 
not been able to construct an algorithm with such a proven performance bound. 

10 Appendix 

Lemma 9 (Bernstein's inequality for martingale differences 110].) Let X\, . . . , X n be a martingale 
difference sequence such that X t G [a, b] with probability one (t = 1, . . . , n). Assume that, for all t, 



Then, for all e > 0, 



and therefore 



EfxflXt-i,...,^] <a 2 a.s. 

P <^ 22 X t > € > < e 2n^+2 £ (6-a)/3 



. t=l 



J^Xt > ^2na 2 In + 2 In 6' 1 (b - a) /3 \ < 5. 



. t=i 
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